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COMPUTATION- BASED RELIABILITY ANALYSIS 
John F. Meyer 

1. Introduction 

Quantitative methods of analyzing system reliability 
have been recognized as an important need since the end 
of World War II. Early objects of reliability analysis 
were pieces of electronic communication and control 
equipment whose functional requirements were relatively 
easy to specify. Accordingly, what constituted the 
"success" or "failure" of such systems could be described 
in a straightforward manner and reliability measures 
such as "probability of success,", "mean-time-to-f ailure , " 
^availability , " etc. were relatively easy to formulate. 

During the last thirty years, however, the structural 
and functional complexity of man-made systems has in- 
creased trememdously , particularly in the computer field, 
and the reliability analysis task has likewise become much 
more complex. This is especially true in the case of 
fault-tolerant computing systems, where various types of 
structural redundancy must be accounted for in the analysis . 



2 


Beginning with the reliability analysis of fault- 
tolerant logic networks [1] and relay networks [2], 
a considerable amount of effort has been devoted to 
'developing analytic methods for assessing the reliability 
of computing systems. Recent contributions in this regard 
include the analysis of f ault-tolerant systems based on 
architecture-level descriptions of their structure [3]— [6] - 
In general, these methods are based on formal models which, 
at some desired level of abstraction, represent the structure 
of the systems to be analyzed. Given a particular class 
of models, the ability to "rely on" a system is then quanti- 
fied via one or more "reliability measures" (defined on 
the model class). In defining such measures, what it means 
to "rely on" a system is usually expressed in terms of some 
underlying concept of system "success" or "failure." 

Indeed, the measure that is commonly referred to as 
"reliability" can be generally defined as "the probability 
of system success in its use environment" (see [7], for 
example). Thus, it is the meaning of success (or failure) 
that gives meaning to a reliability measure and, in turn, 
to any analysis that uses the measure. 

In the discussion that follows, we wish to focus on 
the last of these issues, namely the concept of "success" 
as it pertains to the reliability analysis of computing 
systems. In particular, we contend that success criteria 
should be "computation-based" so that they can adequately 
reflect the computational needs of the user. This is in 
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contrast to "structure-based" success criteria which 
can specify what is required of a system's structure, 
but can specify what constitutes successful behavior 
only as it depends on the success of the structure. In 
relatively simple computing systems where computational 
integrity is closely related to structural integrity 
(e.g., as the success of addition relates to the struc- 
tural integrity of an adder), such structure-based success 
criteria may indeed suffice. On the other hand, if the 
success of a computation depends not only on structure but 
also on such things as the state of the system prior to 
initiation of the computation, the time of initiation, and 
the input data, then structure-based criteria cannot 
express variations in success that result from such depen- 
dencies. Consequently, reliability measures that utilize 
structure-based criteria may not be indicative of a system's 
ability to successfully perform computations. 

The purpose of the discussion that follows is to 
formally establish the point we have just made. We begin 
with an example that is intended to illustrate the need 
for computation-based reliability analysis, that is, analysis 
that utilizes computation-based success criteria. We then 
develop a formal model of a computer that is just complex 
enough to admit to the formulation of computation-based 
criteria and, in turn, reliability measures that utilize 
these criteria. Finally, we apply the formal model to 
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illustrate how the results of a computation-based analysis 
can differ from those of a structure-based analysis. 

2 ; The Need 

To illustrate the need for computation-based 
reliability analysis consider, for example, the reliability 
equations that have been developed for systems employing 
modular redundancy and sparing [3], [4]. These 

equations are based solely on a structural representation 
(at the architecture level) of the system in question 
and, consequently, whatever the underlying success 
criteria may be, they too are structure-based. 

In particular, therefore, these criteria apply as 
well to a fault-tolerant aerospace computer as they do 
to a fault-tolerant pocket calculator. Let us examine, 
on the other hand, how the computational requirements 
for two such systems might differ. 

A computer housed in an aircraft or spacecraft 
is called on to perform a variety of functions at different 
times and for different lengths of time during the course 
of a flight (see [8], for example). A pocket calculator 
may be required only to add or multiply (any time it is 
called on to do so). What constitutes successful com- 
putation is likewise very different. In the case of an 
aircraft or spacecraft computer, success criteria will 
vary according to what function is computed and when 
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the computation takes place. For example, more strict 
criteria might apply to flight control computations 
during an automatic landing than to graphic display 
computations performed while en route. Consequently, 
a structural failure of a given type might cause a 
computational failure if it occurs during an automatic 
landing, but might be tolerated if it occurs before 
that time. In the case of a pocket calculator, on the 
other hand, success criteria are simple and essentially 
independent of what is computed or when computations 
take place. 

Given these differences in computational requirements, 
as they are perceived by the users of each system, 
let us examine the consequences of applying structure- 
based reliability equations of the type referred to at 
the outset. In the case of a pocket calculator, the 
reliability values determined by the equations may be 
quite meaningful to the user since, here, the success 
or failure of a computation corresponds closely to the 
success or failure of the structure (as defined by the 
structure-based success criteria of the model). In the 
case of an aircraft or spacecraft computer, on the 
other hand, the reliability values determined by the 
equations may be misleading since, in general, the success 
or failure of computations will not correspond to the 
success or failure of the computer's structure. In 
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particular, as noted earlier, a structural failure might 
correspond to computational failure at one time (while 
landing) and to computational success at another time 
(while en route). In other words, "success in the use 
environment" may differ considerably from the kind of 
structure-based success criteria that models of this type 
employ. 

The example just cited is indicative of the need to 
more fully account for the behavior of a computer (i.e., 
the computations it performs) when analyzing its relia- 
bility. To accomplish this, reliability measures must 
refer to concepts of system success which involve more 
than just the status of various components or subsystems. 

For systems described at the architecture-level, 
this need has already been acknowledged by Bouricious, 
et al., [5], [6] through the introduction of parameter 

called "coverage. " According to their definition, coverage 
is "the conditional probability that , given the existence of 
a failure in the operational system, the system is able to 
recover and continue information processing with no 
permanent loss of essential information." Thus, coverage 
involves the kind of "computation-based" success criteria, 
the use of which we are advocating. 

To analytically evaluate coverage and, more generally, 
any computation-based reliability measure, the system models 
used must be capable of representing behavior (compu- 
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tations) as well as structure (the computer). In this 
regard, one should not be misled by the use of the coverage 
parameter in connection with purely structural models. 
Although such models can employ coverage as a parameter 
(as was done when the concept was first introduced [5]), 
they cannot be used to evaluate the parameter. The latter 
problem is the type of problem that we are concerned with 
here and one that we feel deserves further investigation. 

In the remainder of this paper, our objective is to 
establish, in quite simple and general terms, the kinds 
of things that need to be considered in developing models 
and measures for computation-based reliability analysis. 

3. Computers with Faults 

We begin by viewing a digital computer as a rather 
general type of system which, at discrete points in time, 
receives input data which, in turn, effects changes in the 
system's internal state. It will be assumed that time is 
represented by the natural numbers, i.e., the time base 
is the set T = {0,1,2,...} . It will be further assumed 
that the state set is "coordinatized" where a subset of the 
coordinates represent the values of those state variables 
that are observable as output variables. 

The transition structure of such a system may vary 
with time because faults (structural failures) occur or 
because the system is reconfigured in an attempt to 
recover from a fault. At a given instant of time the 
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structure is fixed, however, and is described by a 
transition function which determines the state of the 
computing system at time i + 1, given the state at time i 
and the input received at time i. Formalizing this 
notion, we have: 

Definition: A (formal) computer is a system 

C = (X , Q , A ) 

where 

X is a nonempty set, the input set of C , 

Q is a nonempty set, the state set of C , 

A is a sequence of functions 

A = (<$q, Sf, <S 2 , • • • ) 

where 6_^: Q x X -* Q, the transition function of C 

at time i (i e T) . 

Thus a computer, as defined above, is a 'discrete- time , 
time-varying system whose structure at time i is des- 
cribed by transition function 6^ . In particular, if 
q e Q is the state of C at time i and a e X is the input 
received at time i then 6^(q,a) is the state of C at time 
i + 1 . In case structure does not vary with time, 
that is, 


i+1 


= 6 i> 


for all I e T 


( 3 . 1 ) 
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then C is time-invariant . Thus if C = (X,Q,A) is time- 
invariant, A is uniquely determined by 6 q and C can 
alternatively be regarded as a (state) sequential machine 
with (fixed) transition function 6 = 6 q . 

. A computer is finite-input if |X| < «> and finite- 
state if | Q | < 00 . Note that even in case a computer 
is both finite-input and finite-state, it is not finitely 
specifiable unless its structure A is finitely specifiable. 
However, in the subsequent application of this model to 
reliability analysis, all computers (both fault-free and 
faulty) of concern in the analysis will indeed be finitely 
specifiable. 

The most general view of computer behavior is that 
of "string manipulation." Beginning in some initial 
state q Q , determined by the program to be executed and 
by stored data, at some initial time i, C receives an 
input sequence of symbols a^a^.-a^^ where a^ e X is 
interpreted as the input received at time i + j . In 
response to this input sequence, there results a sequence 
(trajectory) of states qQq^ ... q n where q^ e Q is inter- 
preted as the state of C at time i + j . Thus the 
"state behavior" of C may be viewed as a function from 
T x x* into Q + where T is the time base, X* is the set 
of all finite-length sequences of input symbols (including 
the null sequence A), and Q + is the set of all finite- 
length sequences of states . More precisely, if C = (X,Q,A) 
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and q e Q , the state-behavior of C in q is a function 
a^: T x x* ->- Q + defined inductively as follows for all 
i e T: 

i) a (i,A) = q 
If x e X* , a e X: 

ii) a ( i , xa ) = a (i,x)6j(q' ,a) 

where j = £g(x) (the length of x) 

and a' is the final state of a (i,x) 

q 

It is easy to verify that this formal notion of state- 
behavior captures the intuitive notion discussed above. 

Note that maps input sequences of length n into state 
trajectories of length n + 1 . 

Having established the concepts of "computer" and 
"state-behavior," we adopt a concept of "computation" 
that is somewhat more general than usually considered. 

Since computational errors may be due to erroneous initial 
states, initial times, and input sequences, as well as 
to faulty computers, we regard a computation as con- 
sisting of four things: an initial state q, an initial 

\ 

time i, an input sequence x and a state sequence y . More 
precisely, a computation (over X and Q) is a quadruple 
(q,i,x,y) where q e Q , i e T , x e X* and y e Q + such that 
&g(y) = £g(x) + 1. Accordingly, q, i, x, and y are 
referred to as the initial state , initial time, input 


sequence and state trajectory (respectively) of the com- 
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putation . Relative to a particular computer C, a com- 

putation of C is a computation of the form (q,i ) x,a n (i,x)) 


The fundamental question of deciding whether a computer 

is a "success in its use environment" will be based on 

the nature of such computations. However, even more 

basic than the notion of a computational error is the concept 

of a "fault," that is, a transient or permanent change 

in computer structure that may, in turn cause errors. 

In terms of the concepts of a "representation scheme" 
and a "system with faults" [ 9 ] , the "specification class 
& and "realization class" that we wish to consider 
is the class of all computers (as defined above), that 
is, both y and 0t are equal to the class 

(C|C is a computer} . 

Moreover, we will restrict our attention to faults that 
occur during the use of a computer (as opposed to faults 
that occur during the design process) and so, in the 
representation scheme (^.^.p) , p is taken to be the 
identity function. In this representation scheme, a 
"computer with faults" will be defined as follows. 

The "fault-free" specification, that is, the des- 
cription of the underlying system as it exists before 
any physical failures occur, will be assumed time- 
invariant (see condition (3.1)). This is not unreasonable 
since many physical systems and, in particular, most 
computing systems can be represented as time-invariant 
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systems as long as there are no structural changes due 
to physical failures. Suppose now that a physical failure 
does occur where the failure may be transient, permanent, 
or a combination of the two, that is, a permanent physical 
failure that has a transient component while the permanent 
change is taking place. Such physical failures can then 
be represented by (formal) faults as follows. If C = 

(X,Q,A) is a. computer, a fault of C (at time i) is a 
triple ( t , 7T , i ) where 

t: Q x X -* Q, the transient component , 
ir ; Q x X ■+ Q, the permanent component , 

i is a nonnegative integer, the time of occurrence (i e T) 


The interpretation of (x,7r,i) is a physical failure 
that occurs between time i and time i + 1 . x is the 
transition function that the failing system exhibits while 
the failure is taking place and it is the transition 
function that the system exhibits after the failure has 
taken place. Thus, if f = (x,fr,i) is a fault of computer 

f 

C = (X,Q,A) , the result of f is the computer C = 

(X,Q,A f ) where, if A f = (6 q, 6^,6|, . . . ) then 



5, if 0 < j < i 

\ x if j = i 
tt if j > i . 


( 3 . 3 ) 
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If, in the result of f = (T,7r,i), there is no permanent 
change in structure, that is, tt = then f is a 

transient fault (at time i). A fault (ir,x,.i) which 
represents no change whatsoever, that is, tt = and 

x = ^ i > :i ' s re f erre< i to as a null or improper fault 
(at time i). Given this notion of a fault at time i, 
by a "fault" we will generally mean a sequence of faults 
that represents a succession of physical failures. More 
precisely, a (multiple) fault of C is a sequence 


f 




) 


where i 1 < i Q < ... < i. and f . is a fault of C at time i . 

1 2 k x. j 

The corresponding result of f is an immediate generalization 

of definition (3.3). 

Given these concepts of "fault," "result of a fault," 
we obtain the following specialization of the general 
notion of a "system with faults." 

Definition : A c omputer with faults is a triple (C,F,q>) 

such that 

i) C e where C is time invariant, 
ii) F is a set of faults of C, where F contains at 
least one null fault, 

iiij <p : F -*• ^ , where <p(f) = C (the result of f). 

In keeping with our earlier interpretations of these 
objects, if (C,F,<p) as a computer with faults, C will be 
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referred to as the fault-free computer and if f is not a 
f 

null fault,. C will be referred to as faulty . 

To illustrate each of the ingredients of a computer 
with faults, consider the triply modular redundant (TMR) 
configuration : 



where each module C. is a time-invariant computer C = 

J 

(X, Q, A ) with 

X = Q = {0,1} 
and A = {5,6,6,...} . 

Then this (fault-free) TMR configuration is represented 
by the computer : 

C = ({0,1} ,Q, A ) 

where 

Q = {(q 1 .q 2 .q 3 .q 4 )|q i e {0,1}} 

with q^, q ^ and q 3 representing the states of modules 1, 
2 and 3 (respectively) and q^ representing the value of 
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the voter output. The transition structure 

A = (6 0 ,5 1 ,6 2 , . . . ) 

is given by a fixed function 6=6^ for all i, where 

<S( (q 1 ,q 2 ,q 3 ,q 4 ) , a) = (q 4 , q 2 , q 3 > V (q 4 , q 2 > q£) ) 

with q^ = 6(q i ,a) and y equal to the majority function 
(realized by the voter). 

To illustrate the concepts of a "fault" and the 
"result of a fault," suppose that at time 2 there is a 
transient struck-at-one failure at the output of module 1 
and at time 4 there is a permanent stuck-at-zero 
failure at the output of module 3. Then this succession 
of failures is represented by the (multiple) fault 

f - (f 2 »f 4 ) 

where f 2 is the fault at time 2 and f 4 is the fault at 
time 4. More specifically, is the fault 

( T 2 ’ n 2 ’ 

where (letting ql^ = <5(q^,a)): 

T 2 ((q 1 ,q 2 ,q 3 ,q 4 ),a) (i>q 2 >q 3 >P(l>q 2 jq 3 )) 

and - it 2 =6. 

f A is the fault 
4 

(t 4 ,tt 4 ,4) 
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where 


T 4 ((q 1 ,qg,q 3 ,q 4 ),a) = (q-pq^o.iKq^q^. 0 )) 

and 


f f 

The result. of the fault f is the computer C =({0,1},Q,A ) where 



x 2 if 3 = 2 
ir 2 if j = 3 
t 4 if j = 4 
tt 4 if j > 4 


4. Tolerance Relations for Computations 

Given the class of computers with faults (over some 
specified input set X and state set Q), we now consider 
the basic issue raised at the outset of this discussion, 
namely, the formulation of computation-based success 
criteria. Although such criteria could be formally 
specified in a variety of specific ways, the following 
general formulation appears to be quite reasonable. 

We view a particular computation realized by 
some possibly faulty computer as being a "success" if 
it is "within tolerance" of the desired (error-free) 
computation. In these terms, what is regarded as a 
successful computation (in the use environment) is 
specified by a "tolerance relation" on the set of all 
possible computations. In general, such a relation can 
be formally defined as follows. 
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Definition : If U is the set of all computations (over 

X and Q), a tolerance relation (for computations) is a 
relation c on U such that a is reflexive. 

The reflexive condition of the definition gays 
simply that every computation is within tolerance of 
itself. Accordingly the strongest tolerance relation 
is the relation of equality; the weakest is the relation 
a = U x U where every computation is within tolerance 
of every other computation. The latter says that anything 
the computer does is acceptable and therefore represents 
a theoretical extreme as opposed to a practical one. 

It should also be noted that the concept of tolerance, 
as defined above, is general enough to permit tolerable 
deviations in initial state, initial time and input as well as 
tolerable deviations in the state trajectory. Thus, for 
example, if (q,i,x,y) were the desired computation and 
a delay of up to 5 time steps could be tolerated then 

(q,i+j ,x,y)a(q,i,x,y) . (l<j<5) 

However, the purpose of the present investigation can be 
adequately served by examining how internal causes 
(faults) affect the state trajectory of a computation, 
and neglecting external causes that might affect initia- 
lization, timing, and input. 

Given a computer with faults (C,F,(p) and some 
specified tolerance relation a, it is now possible to define 
precisely what is meant by computational success. To 
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this end, suppose f e F, u is a computation of the 

f 

(possibly faulty) computer C and u' is the computation 
of the fault-free computer C where u' has the same 
initial state, initial time, and input sequence as u. 

Then, assuming no external causes of error, we can regard 
u as a "success" if u is within tolerance of u' . More 
precisely, if a is a tolerance relation and f e F: 

f 

Definition : A computation u of C is a o-success if 

ua(q, i , x, a ( i , x) ) where q, i, and x are the initial state, 

4 . 

initial time and input sequence of u. Otherwise u is a 
q-failure . 

f 

If a computation of C is a a-failure, we will say 
it is caused by f . In case f can cause no a-failures, 
then f is a-tolerated . When the tolerance relation is 
understood, we will drop the reference to a and refer 
to a computation u as simply a "success" or, in the 
opposite case, a "failure." 

Since state trajectories can be distinguished by a 
tolerance relation, the concept of failure, as defined 
above, can capture internal computational failures as 
well as input-output failures. To illustrate, let C be 
the TMR configuration considered earlier and suppose a 
is the relation of equality on U (i.e., uau' iff u = u'). 
Then the fault f = (f 2 > f 4 ), considered in the earlier 
example, can cause a-failures even though f cannot cause 
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input-output failures (assuming the modules are properly 
initialized). To be more specific, let us suppose the 
module transition function a is given by: 


(q,a) 

6(q,a) 

(9,0) 

0 

(0,1) 

1 

(1,0) 

1 

(i,D 

0 


Then, for example, if q = (0,0, 0,0), i = 0, and x = 101 
then 

a q (i,x) = (0,0, 0,0) (1,1, 1,1) (1,1, 1,1) (1,0 ,0,0) 

since the transition function at time 2 is t 2 . On 
the other hand 

a q (i,x) = ( 0 , 0 , 0 , 0 ) ( 1 , 1 , i , 1 ) ( 1 , 1 , 1 , 1 ) ( 0 , 0 , 0 , 0 ) . 

f 

Thus the computations u = (q,i,x,a (x)) and u' = 

(q, i , x, a (x) ) are not equal, that is u ji u' and hence u 

VI 

is a a-failure. 

To continue the example, suppose o' is a second 

tolerance relation which requires only that values on 

the output line (coordinate 4) be what they should be. 

More precisely, (q , i , x , y )o ' (q , i , x , y ' ) if y and y' have 

*fc h 

the same length, say n, and the j 


state of y has the 
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the same 4^ coordinate as the j ^ state of y', i = 
0,1,..., n-1 . Given that o' is the tolerance relation 
of interest, it can be shown that the fault f = (fg.f^) 
does not cause any o'-failures (provided all module states 
are the same when the computation begins). In other 

I 

words, although f can cause internal failures (according 
to tolerance relation a), it can cause no input-output 
failures (according to tolerance relation o'). 

5. Reliability Measures 

In general, a "realiability measure" is a function 
from some class of systems into some set of numbers (or 
product set of numbers) whose value, for a given system, 
reflects the ability to rely on that system in some 
specified use environment. When viewed in this way, 
the concept of a reliability measure includes such 
measures as "mean-time-to-f ailure , " "availability," 
"recoverability," etc., as well as the measure "proba- 
bility of success (in the use environment)." What we 
wish to examine now is how such reliability measures 
might be formulated in terms of the computation-based 
success criteria developed in the previous section. 

The investigation will focus on the measure "probability 
of success," although other measures of the type men- 
tioned above could be dealt with in a similar fashion. 
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In general, to formulate the measure "probability 
of success" where, as earlier, success means "success in 
the use environment," the probabilistic nature of both 
the system and the environment must be taken into account. 
In classical structure-based formulations, the environment 
is usually described by a single parameter t (the duration 
of time that the system is utilized) and assumed to be 
deterministic (i . e. , it is assumed that t has a known 
fixed value when evaluating probability of success). 
However, when other aspects of the environment are con- 
sidered, such as the computational requirements of the 
user, it is more realistic to regard the environment 
as probabilistic. 

To formalize this view, if (C,F,<p) is a computer 
with faults where C = (X,Q,A), the environment of C 
can be represented by a probability space 

(E, <g\P E ) (5.1) 

where 

E = Q x T x X* , 

$ = { E ' | E ' £ E) (the "events” on E) , 

P E : $ + [0,1] is a probability measure. 

Here, an element (q,i,x) in the sample space E describes 
an environment wherein the computer is to realize a 
computation with initial state q, initial time i, and 
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input sequence x. The interpretation of the probability 
measure is the usual one, that is, if E' e <?? then: 

Pp(E') = the probability that the (experienced) 
environment is in the event E' . 

As for the probabilistic nature of the faults of C, 
it can be represented by a second probability space 

(F,JT,P f ) (5.2) 

where F is the set of faults of C 
= { F ' | F ' c F} 

and Pp : & - [0,1] is a probability measure. 

Again the interpretation of P f is the usual one, that is, 
if F' e ^ then : 

Pp(F ' ) = the probability that the (experienced) 
fault is in the event F'. 

Given the spaces (E, ,Pg) and (F,^,Pp), the pr.oba 
bilistic nature of both the environment and the faults 
of C can then be represented by a single space 

(G, <g,P) 

where G = E * F , 

^ = {G ! |G’ c G} , 

and P : ^ -*• [0,1] is the (unique) probability 

measure that satisfies the condition: 
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P({(e,f )}) = P E ({e})-P F ({f}), for all (e,f) e G (5.3) 

Note that Eq. (5.3) expresses an underlying assumption 
that environmental events are independent of faults, 
which we feel is quite reasonable. If G' e , the 
interpretation of P(G' ) is the probability that the 
(experienced) environment and fault has a description 
e and f , .respectively, such that (e,f) e G' . 

A probabilistic framework has now been established 
for a formal definition of "probability of success in the 
use environment" or what we will refer to simply as 
"reliability . " 


Definition : If C = (C,F,<p) is a computer with faults, 

a is a tolerance relation on the computations of C, and 
P is the probability measure defined by Eq . (5.3) then 
the reliability of C (denoted R o (C)) is the probability 

R a (C) = P(H) 


where H = 


^(q,i,x,f ) 


the computation 
is a a-success 


f 

(q, i,x,a (i ,x) ) 


Note that R^ may be viewed as a reliability measure (from 
computers with faults into the real interval [0,1]) 
where the value of R^ for computer C is R (C). Thus the 

above definition yields a whole class of reliability measures 
that differ according to the choice of a tolerance 


relation a . 
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6 , Analysis of a Read-only Memory 

To illustrate the application of computation-based 
reliability measures, let us suppose the system to be 
analyzed is a 1024 word, 32 bit /word read-only memory 
(ROM). Then the ROM (before the occurrence of any 
physical failures), can be represented by the fault- 
free computer C = (X,Q,A) where 

X = {0, l} 10 

Q = X x Y where Y = {0,1} 
and A = (6,6,6,...}. 

To describe the (fault-free) transition function 6 , with 
each "address" a e X we associate a word c(a) e Y , 
the "content of a." Then for all q e Q , 

6 (q , a ) = (a,c(a) ) . 

Suppose further that the physical failures of concern 
are memory cell failures that permanently alter the content 
of an address. Then, for some specific address b, such 
failures can be formally represented by (single) faults 
of the form 

f (b,i) = (T.-rr.i) 

where x = it (i.e., f(b,i) is a permanent fault) and 

j 6(q,a) if a f b 

7r(q,a) = < 

1 (a,c), where c f c(a), otherwise. 
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If null-faults of the form f = (<S,<S,i) are also included, 
then the fault set F is the set of all sequences of 
single faults, i.e., f e F if and only if, for some 
m > 1 , 


f - (f(b- ,i- ),f(b„,i 9 ), . . . ,f(b,i )) 


m' nr 


where i^ < i ^ < . • . < i m • As for the underlying tolerance 
relation a , we assume that no readout errors can be 
tolerated (i.e., there can be no errors in the Y coor- 
dinate values of a state). As no failures are postulated 
for the addressing structure (which determines the X 
coordinate values), we can therefore take a to be the 
relation of equality on the computations of C. 

Regarding the environment of the system, let us 
suppose the ROM is part of an aircraft computer where it 
receives slowly changing address updates at the rate 


of 1 per minute. 


(Time i will be interpreted as the i 


. th 


minute). Let us suppose further that as inputs change, 
the likelihood of repeating a given address is negligible. 
Then, for a mission duration of t minutes (where t <_ 

1024), the environment of the ROM is described by a 
probability space (E, <^,P E ), where E and (r are as 
defined in (5.1), and is subject to the condition 
that , whenever 


.P E ({(q,i,a 0 a 1 . . .a A _ 1 )J) > 0 


) 
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then i = 0 > i - t , and a. f a. if j f k . 

^ J K 

The probabilistic nature of the faults f e F is 
determined by that of the physical memory cells. 

Assuming that a single cell failure changes the contents 
of that cell, a fault f(b,i) corresponds to no failures 
in each of the 32 cells (addressed by b) until time i 
and then at least one cell failure (among the 32) between 
time i and time i + 1. Thus, if cell failures occur 
at constant hazard rate X , the value of (see (5.2)) 
for a fault f(b,i) is given by: 

P F ({f(b,i)}) = e _32Xi (l - e~ 32X ) 

-32Xi -32X ( i+1 ) 

- e - e . 

The probability of a sequence of single faults can be 
formulated in a similar manner. 

We now have enough information to determine the 
reliability of C according to the measure R a . By 
definition, R a (C) is the probability P(H) of the event 
H consisting of all tuples (q,i,x,f) such that the 

f 

computation (q, i,x, a (i ,x) is a c-success, that is 

4 

f f 

a (i,x) = a (i,x) . We note first that P({(q,i,x,a (i,x))}) 
Q Q Q 

will be 0 if i f 0 , lg(x) f t or the sequence x has 
repeated addresses (since P^( { (q, i,x) } ) = 0 under these 
conditions). Thus we need only consider tuples (q,i,x,f) 
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such that i = 0 , x = aga^.-.a^. ^ , and a j f a^ if j f k 

In this case, it follows that the computation u = 

f 

(q, 0 , x , a (0, x) ) is a a-success if and only if, for each 

time i (0 £ i £ x - 1) and for all j such that 0 £ j £ i 

the fault f(a^,j) does not occur in the sequence f. 

This will ensure that, for each time i, the (i+1) 

f "t h 

state of a (0,x) is equal to the (i+1; state of 

4 

a (0,x). Moreover, since a^ occurs exactly once in the 
sequence x, no fault f(a^,j) with j > i , will cause 
a a-failure at a later time. Thus, if we let 


F = {f j (q, 0 , x , a* (0, x) ) is a a-success} , 
x q 


it follows that 


F x - .V 

i=l 


-32Ai 


= e 


-32At (t+l) 
2 


Summing over the environment 


E = {(q,0,x) |P E ( {(q,0,x)}> > 0} 


2 


P(H) = P F ( {q,0,x})-P F (F ) 

( q , 0 , x ) eE E F X 


we have : 
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1 


__ Pp,({q,0,x})*e 
(q,0,x)eE 


-32Xt(t+l) 
2 


= 1 • e 


-32Xt ( t+1 ) 
2 


Hence R^(C) = e 


-32\t ( t+i') 
2 


( 6 . 1 ) 


The above example is intended to illustrate the many 
concepts introduced in previous sections and, for this 
reason, the development has been somewhat more lengthy 
than what would normally be required to achieve the end 
result. The example also serves to illustrate how a 
computation-based reliability measure can differ from 
a structure-based measure. In particular, suppose we 
consider the usual structure-based success criteria for 
the example in question, that is, "no memory cell failures 
during the utilization interval t." Since the ROM 
has 1024 words with 32 bits/word , the reliability 
(probability of success) in this case is given by 


R(C) = e _ (1024)32Xt 


( 6 . 2 ) 


Comparing Eqs. (6.1) and (6.2), we note first that 
the structure-based formula says that the system failure 
rate is constant, while the computation-based formula 
does not. Second, we note that, even in the case of 
maximum utilization for the environmental assumptions 
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of (6.1) (i.e., t = 1024) wherein all addresses are 

interrogated, the effective failure rate given by the 

computation-based measure is only half that of the 

structure-based measure. To obtain a more concrete 

comparison, suppose that the memory cell hazard rate 
-7 

is 10 failures per hour and the utilization interval 

is 10 hours (which might be required of a long-range 

-7 

aircraft). Then A = 10 /60 , t = 600 and substituting 

in Eq. (6.1): 

V C) = e-9-616,10- 3 
= . 9904 . 

On the other hand, substituting these same values in 

( 6 . 2 ): 

_o 

-3 97RS/1 n 

R(C ) = e 

= .9680 . 

Thus the computation-based measure yields a considerably 
higher estimate of the ROM's reliability (in this use 
environment) than does the structure-based measure. 
Judging by other examples we have looked at, this kind 
of difference is typical. In other words, structure- 
based measures will often yield a more pessimistic view 
of a computer's reliability than is warranted by the 
computational needs of the user. 
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7. Conclusion 

The purpose of this investigation has been to give 
a precise meaning to the notion of a computation-based 
reliability analysis in terms of a simple but general 
model of a computer with faults. This is not to suggest 
that the proposed model is the only one that could be 
or should be used in the analysis of a specific class 
of computing systems. The investigation does indicate, 
however, the kinds of things that should be considered 
if reliability measures are to more accurately reflect 
computational needs of the user. It is hoped that this 
will provide a framework for more detailed investigations 
regarding the feasibility of computation-based analysis 


methods. 
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